Prunus persica Whole Genome Assembly v2.0 & Annotation v2.1 (v2.0.a1)
Overview
For use in publications, please CITE the papers below: Verde I, Jenkins J, Dondini L, Micali S, Pagliarani G, Vendramin E, Paris R, Aramini V, Gazza L, Rossini L, Bassi D, Troggio M, Shu S, Grimwood J, Tartarini S, Dettori MT, Schmutz J (2017) The Peach v2.0 release: high-resolution linkage mapping and deep resequencing improve chromosome-scale assembly and contiguity. BMC Genomics 18:225 DOI: 10.1186/s12864-017-3606-9 The International Peach Genome Initiative (2013). The high-quality draft genome of peach (Prunus persica) identifies unique patterns of genetic diversity, domestication and genome evolution. Nat Genet 45, 487-494 (2013) doi:10.1038/ng.2586 And cite the version (Peach v2.0.a1 (v2.1)) and any URL below.
About the AssemblyOverview The peach genome sequencing project was initiated in 2008 by the International Peach Genome Initiative, an International consortium led by Italian and US scientists (Ignazio Verde, Albert Abbott, Jeremy Schmutz, Michele Morgante and Daniel Rokhsar). The first version (Peach v1.0) was released under Fort Lauderdale Agreement on April 2010 and the results were published on 2013 on Nature Genetics). The peach v1.0 assembly was improved using large community molecular mapping data obtained on three linkage maps. 7.3 Mb of previously unmapped sequences (11 scaffolds) were integrated within the eight peach pseudomolecules and nine randomly oriented scaffolds (20 Mb) were correctly disposed. The use of a large mapping dataset has also allowed to fix seven regions (12.2 Mb) incorrectly positioned along the pseudomolecules due to misassembly issues. As a result of these mapping efforts, the peach v2.0 has now an outstanding 99.2% of mapped sequences with 97.9% oriented.
The base accuracy and contiguity were improved using contigs generated by an ABySS assembly of WGS Illumina reads (42x of 2x250 bp, 600 bp insert). Advancements include the correction of homozygous SNPs (859) and indels (1347) as well as minor assembly gaps (212 gaps closed with a gain of 25,199 bp). As a result, the contiguity of the Peach v2.0 was increased to a contig L50 of 255.4 kb (214.2 kb in Peach v1.0) and a contig N50 of 250 (294 in Peach v1.0).
The annotation of the repeated fraction was also enhanced including low copy repeats and the complete sequence and location of 1,157 non-autonomous Helitrons.
Gene prediction and annotation were upgraded using transcript assemblies obtained from 2.2 billion of RNA seq reads from different peach tissues and organs. In total, after masking with the advanced repeats annotation, 26,873 protein-coding genes were predicted in the Peach v2.1 annotation, 991 less than those predicted in Peach v1.0. Gene annotation was highly enhanced with the prediction of almost 20,000 new isoforms.
Statistics
This release of Phytozome includes the JGI v2.1 gene annotation of assembly v2.0. 225.7 Mb arranged in 8 pseudomolecules, with a small additi onal amount of mostly repetitive sequences in unmapped scaffolds
Genome Size
Approximately 227.4 Mb arranged in 191 scaffolds
Approximately 224.6 Mb arranged in 2,525 contigs (~ 1.2% gap)
Scaffold N50 (L50) = 4 (27.4 Mbp)
Contig N50 (L5) = 250 (255.4 Kbp)
11 scaffolds larger than 50 Kbp, with 99.4% of the genome in scaffolds larger than 50 Kbp
Loci
26,873 loci containing protein-coding genes
Transcripts
47,089 protein-coding transcripts
Sequencing, Assembly, and Annotation
Gene Prediction and Locus Naming
Short reads (~1B single ends and ~1.2B paired ends Illumina RNA-seq in various length ranging from 75 BP to 100 BP, and 3M 454) from various labs around the globe were used to construct transcript assembles (TAs) (Shu et. al., manuscript in preparation). 106,848 transcript assemblies were constructed using PASA (Haas, 2003) from 383,498 sequences in total, consisting of the TAs above, as well as Sanger ESTs, and 23,448 transcript assemblies from related species ESTs (424,656 sequences). Loci were determined by transcript assembly alignments and/or EXONERATE alignments of proteins from arabidopsis (Arabidopsis thaliana), rice, grape, soybean and Swiss-Prot eukaryote proteins to soft-repeatmasked Prunus persica genome using RepeatMasker (Smit, 1996-2012) with up to 2K BP extension on both ends unless extending into another locus on the same strand. Gene models were predicted by homology-based predictors, FGENESH+ (Salamov, 2000), FGENESH_EST (similar to FGENESH+, EST as splice site and intron input instead of protein/translated ORF), and GenomeScan (Yeh, 2001).
The highest scoring predictions for each locus are selected using multiple positive factors including EST and protein support, and one negative factor: overlap with repeats. The selected gene predictions were improved by PASA. Improvement includes adding UTRs, splicing correction, and adding alternative transcripts. PASA-improved gene model proteins were subject to protein homology analysis to above mentioned proteomes to obtain Cscore and protein coverage. Cscore is a protein BLASTP score ratio to MBH (mutual best hit) BLASTP score and protein coverage is highest percentage of protein aligned to the best of homologs. PASA-improved transcripts were selected based on Cscore, protein coverage, EST coverage, and its CDS overlapping with repeats. The transcripts were selected if its Cscore is larger than or equal to 0.5 and protein coverage larger than or equal to 0.5, or it has EST coverage, but its CDS overlapping with repeats is less than 20%. For gene models whose CDS overlaps with repeats for more that 20%, its Cscore must be at least 0.9 and homology coverage at least 70% to be selected. The selected gene models were subject to Pfam analysis and gene models whose protein is more than 30% in Pfam TE domains were removed.
References:
Haas, B.J., Delcher, A.L., Mount, S.M., Wortman, J.R., Smith Jr, R.K., Jr., Hannick, L.I., Maiti, R., Ronning, C.M., Rusch, D.B., Town, C.D. et al. (2003) Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. http://nar.oupjournals.org/cgi/content/full/31/19/5654 [Nucleic Acids Res, 31, 5654-5666].
Smit, AFA, Hubley, R & Green, P. RepeatMasker Open-3.0. 1996-2011 .
Yeh, R.-F., Lim, L. P., and Burge, C. B. (2001) Computational inference of homologous gene structures in the human genome. Genome Res. 11: 803-816.
Salamov, A. A. and Solovyev, V. V. (2000). Ab initio gene finding in Drosophila genomic DNA. Genome Res 10, 516-22.
Locus name and transcript name mapping from previous annotation version
The locus model name of a v1.0 gene is mapped to a corresponding v2.1 gene as alias if 1) the v1.0 and v2.1 loci overlap uniquely and appear on the same chromosome, and 2) at least one pair of translated transcripts from the old and new loci are MBH's (mutual best hits) with at least 70% normalized identity in a BLASTP alignment (normalized identity defined as the number of identical residues divided by the longer sequence). 77.38% v1.0 loci are mapped.
Contacts
Principal Collaborators:
JGI Contacts:
IGA Contacts:
GDR contact: Dorrie Main (WSU) (email: dorrie AT wsu DOT edu)
Associated Publications
Prunus persica annotation v2.1 on assembly v2.0 (v2.0.a1)
Transcripts
For Primary transcripts:
Gene model support (value is number of gene models):
Homology
Homology of the Prunux persica v2.0.a1 transcripts was determined by pairwise sequence comparison using the blastx algorithm against various protein databases. The results are available for download in Excel format. An expectation value cutoff less than 1e-6 was used for arabidoposis proteins and 1e-9 for the NCBI nr, Uniprot SwissProt, and Uniprot TrEMBL databases. Protein Homologs
Download
All assembly and annotation files are available for download by selecting the desired data type in the left-hand side bar. Each data type page will provide a description of the available files and links to download.
Assembly
The Prunus persica v2.0.a1 genome assembly files are available in FASTA and GFF3 formats. There are a total of 8 pseudomolecules and 183 scaffolds in this assembly of peach. Downloads
Gene Predictions
The Prunus persica v2.0.a1 genome gene prediction files are available in FASTA and GFF3 formats. Downloads
Functional Analysis
Functional annotation for the Prunus persica v2.0.a1 genome are available for download below. The peach proteins were analyzed using InterProScan in order to assign InterPro domains, Gene Ontology (GO) terms. Pathways analysis was performed using the KEGG Automatic Annotation Server (KAAS). Downloads
SNPs
IRSC 9K peach SNPs anchored to Prunus Persica whole genome v1.0 assembly
Markers
The alignment tool 'BLAT' was used to map Prunus genetic marker sequences to the Peach genome v2.0.a1. Markers required 90% identity over 97% of their length. For SSRs & RFLPs, the gap size was restricted to 1000 bp or less with less than 2 gaps. The available files are in Fasta and GFF3 format. You can also find original information about these markers on the Prunus persica Whole Genome v1.0 Assembly & Annotation detail page. Downloads
|